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Abstract — Independent component analysis (ICA) is a com- 
putational method for separating a multivariate signal into 
subcomponents assuming the mutual statistical independence 
of the non-Gaussian source signals. The classical Independent 
Components Analysis (ICA) framework usually assumes linear 
combinations of independent sources over the field of real- 
valued numbers TZ. In this paper, we investigate binary ICA 
for OR mixtures (bICA), which can find applications in many 
domains including medical diagnosis, multi-cluster assignment, 
Internet tomography and network resource management. We 
prove that bICA is uniquely identifiable under the disjunctive 
generation model, and propose a deterministic iterative algorithm 
to determine the distribution of the latent random variables and 
the mixing matrix. The inverse problem concerning inferring 
the values of latent variables are also considered along with 
noisy measurements. We conduct an extensive simulation study 
to verify the effectiveness of the propose algorithm and present 
examples of real-world applications where bICA can be applied. 

Index Terms — Boolean functions, clustering methods, graph 
matching, independent component analysis. 

I, Introduction 

Independent component analysis (ICA) is a computational 
method for separating a multivariate signal into additive sub- 
components supposing the mutual statistical independence of 
the non-Gaussian source signals. The classical Independent 
Components Analysis (ICA) framework usually assumes linear 
combinations of independent sources over the field of real- 
valued numbers 1Z. Consider the following generative data 
model where the observations are disjunctive mixtures of 
binary independent sources. Let x = [x\, X2, ■ ■ ■ ,x n ] T be 
a m-dimension binary random vector with joint distribution 
'P(x), which are observable, x is generated from a set of n 



independent binary random variables y = [2/1,2/2, 
follows, 



■ .</» 



(1) 



x i = \J (9ij A?/,), i = l,...,m, 
3=1 
where A is Boolean AND, V is Boolean OR, and g^ is the entry 
in the i'th row and j'th column of an unknown binary mixing 
matrix G. Throughout this paper, we denote by Gi : - and G : j 
the i'th row and j'th column of matrix G respectively. For 
the ease of presentation, we introduce a short-hand notation 
for the above disjunctive model as, 

x = G?® y. 




Fig. 1: Illustration of the OR mixture model 



The relationship between observable variables in x and 
latent binary variables in y can also be represented by an 
undirected bi-partite graph G = (U, V, E), where U = 
{x!,x 2 , ■ ■ • , x m } and V = {y 1 ,y 2 ,-.., y n } (Figure [I}. An 
edge e = (Xi,yj) exists if gij = 1. We will refer to G as the 
binary adjacency matrix of graph G. The key notations used 
in this paper are listed in Table II] 

Consider an m x T matrix X and an n x T matrix Y, 
which are the collection of T realizations of random vector x 
and y respectively. The goal of binary independent component 
analysis with OR mixtures (bICA) is to estimate the distribution 
of the latency random variables y and the binary mixing matrix 
G from X such that X can be decomposed into OR mixtures 
of columns of Y. 

In this paper, we make the following contributions. First, 
we investigate the identifiability of bICA and prove that under 
the disjunctive generation model, the OR mixing is identifiable 
up to permutation. Second, we develop an iterative algorithm 
for bICA that does not make assumptions on the noise model 
or prior distributions of the mixing matrix. Interestingly, the 
approach is shown to be robust under moderate to medium 
XOR noises and insufficient samples. We furthermore consider 
the inverse problem of inferring the values of y given noisy 
observations X and the inferred model. Finally, we present 
two case studies to illustrate how bICA can be used to model 
and solve real-world problems. . 

The rest of the paper is organized as follows. In Section [II] 
we give a brief overview of related work. In Section III several 
important properties of bICA are proved. In Section |IV] we 
elaborate on an iterative procedure to infer bICA and several 
complexity reduction techniques. Formulation and solution to 
the inverse problem with noisy measurements are presented 



TABLE I: Notations 



m, n, T 


the number of observable variables, 
hidden variables and observations 


G 


the bi-partite graph representing the 
observable-hidden variable relationship 


x mxl 


the vector of m binary observations 
from m observable variables 


Jnxl 


the vector of n binary activities 
from n hidden variables 


X mx T 


the collection of T realizations of x 
(observation matrix) 


y nX T 


the collection of T realizations of y 
(activity matrix) 


**Tnxn 


the binary adjacency matrix of graph G 


Plxn 


the probability vector associated with 
n (Bernoulli) hidden variables 



in Section M Effectiveness of the proposed method is eval- 
uated through simulation results in Section [VT] Real-world 
problem domains where bICA can be applied are discussed 
in Section |VII| and followed by conclusion and future work in 
Section Emj 

II. Related work 

Most ICA methods assume linear mixing of continuous 
signals (TJ. A special variant of ICA, called binary ICA 
(BICA), considers boolean mixing (e.g., OR, XOR etc.) of 
binary signals. Existing solutions to BICA mainly differ in 
their assumptions of the binary operator (e.g., OR or XOR), 
the prior distribution of the mixing matrix, noise model, and/or 
hidden causes. 

In 10, Yeredor considers BICA in XOR mixtures and in- 
vestigates the identifiability problem. A deflation algorithm is 
proposed for source separation based on entropy minimization. 
Since XOR is addition in GF(2), BICA in XOR mixtures can 
be viewed as the binary counterpart of classical linear ICA 
problems. In [2 J, the number of independent random sources 
K is assumed to be known. Furthermore, the mixing matrix 
is a K-by-K invertible matrix. Under these constraints, it has 
been proved that the XOR model is invertible and there exist 
a unique transformation matrix to recover the independent 
components up to permutation ambiguity. Though our proof 
of identifiability in this paper is inspired by the approach in 
0, due to the "non-linearity" of OR operations, the notion of 
invertible matrices no longer apply. New proofs and algorithms 
are warranted to unravel the properties of binary OR mixtures. 

In J3), the problem of factorization and de-noise of binary 
data due to independent continuous sources is considered. 
The sources are assumed to be continuous following beta 
distribution in [0, 1]. Conditional on the latent variables, 
the observations follow the independent Bernoulli likelihood 
model with mean vectors taking the form of a linear mixture 
of the latent variables. The mixing coefficients are assumed 
to non-negative and sum to one. A variational EM solution is 



devised to infer the mixing coefficients. A post-process step 
is applied to quantize the recovered "gray-scale" sources into 
binary ones. While the mixing model in can find many 
real world applications, it is not suitable in the case of OR 
mixtures. 

In |4|, a noise-OR model is introduced to model dependency 
among observable random variables using K (known) latent 
factors. A variational inference algorithm is developed. In 
the noise-OR model, the probabilistic dependency between 
observable vectors and latent vectors is modeled via the noise- 
OR conditional distribution. The dimension of the latent vector 
is assumed to be known and less than that of the observable. 

In 0, Wood et al. consider the problem of inferring 
infinite number of hidden causes following the same Bernoulli 
distribution. Observations are generated from a noise-OR 
distribution. Prior of the infinite mixing matrix is modeled as 
the Indian buffet process |6|. Reversible jump Markov chain 
Monte Carlo and Gibbs sampler techniques are applied to 
determine the mixing matrix based on observations. In our 
model, the hidden causes are finite in size, and may follow 
different distribution. Streith et al. [7| study the problem 
of multi-assignment clustering for boolean data, where the 
observations are either drawn from a signal following OR 
mixtures or from a noise component. The key assumption 
made in the work is that the elements of matrix X are 
conditionally independent given the model parameters (as 
opposed to the latent variables). This greatly reduces the 
computational complexity and makes the scheme amenable to 
a gradient descent-based optimization solution. However, this 
assumption is in general invalid. 

There exists a large body of work on blind deconvolution 
with binary sources in the context of wireless communica- 
tion 0, 0. In time-invariant linear channels, the output signal 
x(k) is a convolution of the channel realizations a(k) and the 
input signal s(k), k — 1,2, ... ,K as follows: 



x(k) = Y^ a(l)s(k -l),k = l,...,K, 



(2) 



1=0 



The objective is to recover the input signal s. Both stochastic 
and deterministic approaches have been devised for blind 
deconvolution. As evident from 0, the output signals are 
linear mixtures of the input sources in time, and additionally 
the mixture model follows a specific structure. 

Literature on boolean/binary factor analysis (BFA) is also 
related to our work. The goal of BFA is to decompose a 
binary matrix X mx r into A mxn B nX T with (g) being 
the OR mixture relationship as defined in ([TJ. X in BFA is 
often called an attribute-object matrix providing m-dimension 
attributes of T objects. A and B are the attribute-factor 
and factor-object matrices. All the elements in X, A, and 
B are either or 1. n is defined to be the number of 
underlying factors and is assumed to be considerably smaller 
than the number of objects T, BFA methods aim to find 
a feasible decomposition minimizing n. Frolov et al. study 
the problem of factoring a binary matrix using Hopfield 



Algorithm 


Sources 


Generative model 


Under/Over determined 


Dimension of latent variables 


|2| 


Binary 


Binary XOR 


- 


Known 


|3J 


Continuous 


Linear 


Over 


Known 


141 


Binary 


Noise-OR 


Over 


Known 


® 


Binary 


Noise-OR 


Under 


Infinite 


ID 


Binary 


Binary OR 


Over 


Known 


1 10|, |11| 


Binary 


Binary OR 


Over 


Unknown, try to minimize 


18], |91 


Binary 


Linear 


- 


Known 


bICA 


Binary 


Binary OR 


Under 


Unknown but finite 



TABLE II: Related work comparison chart 



neural networks 1121 . ifTUl . |[T3ll . This approach is based on 
heuristic and do not provide much theoretical insight regarding 
the properties of the resulting decomposition. More recently, 
Belohlavek et al. propose a matrix decomposition method 
utilizing formal concept analysis fffl . The paper claims that 
optimal decomposition with the minimum number of factors 
are those where factors are formal concepts. It is important 
to note that even though BFA assumes the same disjunctive 
mixture model as in our work, the objective is different. While 
BFA tries to find a matrix factorization so that the number 
of factors are minimized, bICA tries to identify independent 
components. One can easily come up an example, where the 
number of independent components (factors) is larger than the 
number of attributes. Since BFA always finds factors no larger 
than the number of attributes, the resulting factors are clearly 
dependent in this case. 

Finally, J5| consider the under-presented case of less ob- 
servations than latent sources with continuous noise, while 
0, 0, iflOl . ifTTI deal with the over-determined case, where 
the number of observation variables are much larger. In this 
work, we consider primarily the under-presented cases that 
we typically encounter in data networks where the number of 
sensors are much smaller and the number of signal sources 
(i.e. users). 

We summarize the aforementioned related work in Table [H] 

III. Properties of bICA 

In this section, we investigate the fundamental properties 
of bICA. In particular, we are interested in the following 
questions: 

• Expressiveness: can any set of binary random variables 
be decomposed into binary independent components us- 
ing OR mixtures? 

• Independence of OR mixtures: for mixtures of in- 
dependent sources, what is the condition that they are 
independent? 

• Identifiability: given a set of binary random variables 
following the bICA data model, is the decomposition 
unique? 

Expressiveness: Expressiveness of OR mixtures is limited. 
This can be shown through an example. Let y\ and y 2 be two 
independent binary random variables with P{y\ = 1) =p^ 
0.5 and P(y 2 = 1) = q ^ 0.5. Let xi = j/i and x 2 = j/i +ys, 
where '+' is addition in the finite field GF(2). It is easy to 



see that x\ and x 2 are correlated since P(x 2 = 1) = P{yi = 
l)P(y 2 = 0) + P( yi = 0)P(y 2 = 1) = q{\ -p) + p{\ - q), 
P(x! = 1) = p, while P(xi = l,a; 2 = 1) = P(y 2 = 
0) = 1 — q. On the other hand, x 2 can not be decomposed 
into an OR mixture of y\ and y 2 . This essentially shows that 
OR mixtures of binary random variables only span a subset 
of multi-variate binary distributions. There exist correlated 
binary random variables (xi,x 2 in this example) that cannot be 
modeled as OR mixtures of independent binary components. 



Independence of mixtures: Now we turn to the second 
question, namely, under what condition are binary random 
variables that follow the OR mixture model independent. In 
general, pairwise independent random variables are not jointly 
independent. Interestingly, we show that pairwise indepen- 
dence implies joint independence for OR mixtures. 

Theorem 1: Let y = [yi,y 2 , . . . , y n ] T denote n statistically 
independent sources in GF(2), the i-th source having 1- 
probability pi . Let x = D <E> y, where D is a m x n matrix 
(with elements in GF(2)). Let rj(x) and C(x) denote the mean 
and covariance (resp.) of x. If: 



1) All elements of ?/(x) are nonzero and not l's (called non- 
degenerate), 

2) C(x) is diagonal, 

Then i) m = n, and ii) D is a permutation matrix. 

Let us first establish the following lemmas. 

Lemma 1: Let u and v be two RVs in GF(2) with 1- 
probabilities p u and p v (resp.), and let w = u V v If u and v 
are independent, non-degenerate (0 < p, q < 1) then w is also 
non-degenerate. 

Proof: Clearly, p w = P(w = 1) = 1 - (1 -p u )(l -p v ). 
Since < p u ,p v < 1> we nave <p w < 1. ■ 

Lemma 2: Consider non-degenerate independent binary 
random variables yi,y 2 ,y3- Then, X\ = y\ V y 2 and x 2 = y 3 
are independent. 

Proof: To prove independence of two binary random 

variables x\ and x 2 , it is sufficient to show P{x\ — l,x 2 — 



1) = P(xi = l)P(x 2 = 1). 

P(xi = l,x 2 = l) = P(yi = l,y2 = i,ya = i) 

+ P(yi = l,w = 0,i/a = l) 

+ P(y 1 = 0,2/2 = l,j/ 3 = l) 

= P( Vl = l)P(|fe = 1)P( V8 = 1) 

+ P( yi = l)P(y 2 = 0)P(y 3 = 1) 

+ P( yi = 0)P(j/ 2 = l)P(y 3 = 1) 

= P(xi = l)P(x 2 = 1) 



(3) 



Similar, we can show the following result. 

Lemma 3: Consider non-degenerate independent binary 
random variables 2/1,2/2,2/3- Then, X\ = 2/1 V 2/2 and x<z = 
2/1 V 2/3 are correlated. 

Now we are in the position to prove Theorem [TJ 

Proof of Theorem [/} We prove by contradiction. The 
essence of the proof is similar to that in [2J. Let us assume 
now that D is a general matrix, and consider any pair Xk and 
xi (k 7^ I) in x. Xk and xi are OR mixtures of respective 
subgroups of the sources, indexed by the 1-s in D k ., and 
Di : , the fc-th and Z-th rows (respectively) of D. These two 
subgroups consist of, in turn, three other subgroups (some of 
which may be empty): 

1) Sub-group 1: Sources common to D k ,. and Di,. . Denote 
the OR mixing of these sources as u; 

2) Sub-group 2: Sources included in Dk,, but excluded from 
Di-. Denote the OR mixing of these sources as v\\ 

3) Sub-group 3: Sources included in £)/ . but excluded from 
Dk,.- Denote the OR mixing of these sources as v 2 . 

In other words, Xk — u V v\ and xi = u V v%. By applying 
Lemma 13] iteratively, we can show that v\ and v 2 are inde- 
pendent and non-degenerate. Furthermore, if u 7^ 0, then u 
is independent of v\ and v 2 . From Lemma [3] we show that 
Xk and Xi are correlated. This contradicts with the condition 
that C(x) is diagonal. This implies that u = 0. Therefore, the 
two rows Dk,: and Di, do not share common sources, or, in 
other words, there is no column j in D such that both Dkj 
and Dij are both 1. There are only m such columns. Thus, 
m = n. Furthermore, D is a permutation matrix. ■ 

Theorem [Tj necessarily implies the following result: 
Corollary 1: Let x = G y for some G and independent 
non-degenerate sources y. Then, if elements of x is non- 
degenerate and pair-wise independent, the elements in x are 
jointly independent. 

Identiflability: Let x = [x\, . . . , x m ] T . Define the set 

n 

Y ( x ) = {y I V ( g% j A y ^ = Xi ^ i = 1 ' • • • ' m J'- 

3=1 

Therefore, 



P(x) = P(yer(x)) = E ye y (X )P(y) 



E 



yer(x) 



nr=i^ 4 (i -Pi) 1 -* 



(4) 



where P(y) is the joint probability of y, and pt — V{yi = 1) 
The last equality is due to the independence among j/i's. 



To see whether y is uniquely identifiable from x, we first 
restrict G such that it has no identical columns, namely, each 
2/y contributes to a unique set of x/s. Otherwise, if G-,i and 
G u j are identical, we can merge yi and j/j by a new component 
corresponding to ytVyj. Under the restriction, we can initialize 
n = 2 m — 1 and G of dimension m x 2 m — 1 with rows 
being all possible n binary values. The G matrix corresponds 
to a complete bipartite graph, where an edge exists between 
any two vertices in U and V, respectively. For a random 
variable yj 6 V, its neighbors in U is given by the non-zero 
entries in G : j. Thus, at most 2 m — 1 independent components 
can be identified. Given the distribution of random variables 
x G {0, 1}"\ 2 m — 1 equations can be obtained from Q. As 
there are at most 2™ — 1 unknowns (i.e., pi,i = 1, . . . , n), the 
probability of yj can be determined if a solution exists. To see 
the solution uniquely exists, we present a constructive proof 
as follows. 

Let g k , k = 1, . . . , 2 m — 1 be a m-dimension binary column 
vector, and the degree of g k , d(g k ) is the number of ones 
in g k - Define the frequency function Tk = V(x = g k ) = 
V{xi — gik, i = l,... m). For each g k , we associate it with 
an independent component y k . The goal is to show that pk = 
P(Vk — 1) can be uniquely decided. Starting from g k with the 
lowest degree, the derivation proceeds to determine p k with 
increasing degree in g k - 

Basis: It is easy to show that Fq — Ylj=i (1 ~Pj)- Since 
Pk's are non-degenerate, T§ > 0. For k, s.t., d(g k ) = 1, we 
have 

2 m -l 

i=i,i#fe 
Therefore, 



Pk 



J~k + Po 



-Pa 



(5) 



Induction: Define g i -< g if g i 7^ g , and VZ, s.t., gu = 1, 
gij = 1. Let Sk be the set of indices i's, s.t., Pi ^ and 
9i ~4 9k' ^* G <^- If ^fc = ^' tnen @ applies. Otherwise, we 
have 

Pk 

= n,vs fc ^j t (i-Pi)x(Pfc+ 

(1 - Pfc) Escs fe: v ieB 9,=9 fc n ie s ^ n ie s-s fc (! - Pi)) 

(1-P fc )ll ie °s fc (l-Pi) x ^ fe+ 

Bas k ,\/ ieB g i =g k lUeBPilUeB-Sk I 1 _ P*)) 



(i-Pfc)E 

="a x ( T i ^ + 



n jeSfc (i-P.o 

Esc5 fc ,v ieB s'^fl'fc n ie s p» n ie s-s fc i 1 - PW 

where Vigb^j indicates the entry-wise 
of gr/s for i E B. Let us define Lk 
Escs,v ieB 9-9, riiei? Pi UieB-s ( l -Pi))- Then < 

•Pfc Ilie,s fc C 1 - P<) - Fo L k 



Pk 



Pa + Pit Yl ieSk ( l ~Pt)- Po^ 



OR 

A 



(6) 



O-^fc 



It is easy to verify that when the j/i's are non-degenerate, 
all the denominators are positive. This proves that a solution 



to Q exists and is unique. However, direct application of 
the construction suffers from several problems. First, all J^'s 
need to be computed from the data, which requires a large 
amount of observations. Second, the property that JF ^ is 
very critical in estimating Pfc's, When Tq is small, it cannot 
be estimated reliably. Third, enumerating Sk for each k is 
computationally prohibitive. 

IV. Inference of binary independent components 

In this section, we first present a motivating example which 
provide the intuition for our inference scheme, and then 
devise an efficient iterative procedure to estimate pi's. The key 
challenge lies in that both G and 'P(y) are unknown. If G is 
given, the problem becomes trivial and can be easily solved 
by directly applying Maximum-likelihood type of methods. 




Fig. 2: A simple mixture model. Hidden components are shown in 
white disks and observable components are shown in black disks. 

A motivating example: Consider a simple mixture model 
with 2 hidden sources and 2 observables as depicted in 
Figure [2] The probability vector of the sources is p — [0.2 0.5] 
with component y\ having the lower probability being one. 
The marginal probabilities of X\ = 1 and x 2 = 1 can be easily 
computed as 0.6 and 0.5, respectively. Let the realizations of 
y\ and 2/2 over ten trials be: 



Y = 









1 








1 


1 


1 



10 
110 



where yu = 1 indicates that source y,; = 1 at the time slot j. 
Y is hidden and unknown. Since G = 



observation matrix: 



1 



we have the 



0110011110 
0110011100 



The objective of bICA is to infer G and p from X. Since 
the number of observables m — 2, we may identify at most 
2 m — 1 = 3 unique sources, say, y 1; y 2 , 2/3. Denote the inferred 
G and p as 



1 1 
1 1 



,P 



[pi f>2 h] , 



In another word, component yi only manifests through X\, 
7/2 through X2, and 7/3 through both x\ and x-i. Clearly, 



realizations in X where x 2 = correspond to time slots when 
sources 2/2 and 2/3 are both zero. In other words, we have 

V(xi = l,x 2 = 0) =pi{l -j3 2 )(l -p 3 ). 

Note that V(x2 = 0) = (l—p2)(l—p3)- Therefore, we have 
P\ — V{x\ = l\x 2 — 0) sa 0.2. The last term is estimated 
from the realization of X where x 2 = 0. Since x\ = 1 if 
iji = 1 or 2/3 = 1, p 3 can be calculated from 

1 - V( Xl = 1) = (1 - pi)(l - pa) ^p 3 = 0.5. 

Similarity, p 2 can be calculated from 

1 - V{x 2 = 1) = (1 - p 2 )(l - h) => P 2 = 0. 

f>2 = implies that y 2 never activates, and thus its associated 
column can be removed from G. 

The basic intuition of the above procedure is by limiting 
our considerations to realizations where some observables are 
zero, we "null" out the effects of components that contribute 
to these observations. This reduces the size of the inference 
problem to be considered. 

The basic algorithm: Motivated by the above example, we 
will consider G to be a m x 2™ adjacent matrix for the 
complete bipartite graph. Furthermore, the columns of G are 
ordered such that g^i = 1 if I A 2 k = 1, for k — 1, . . . , m, 
where A is the bit-wise AND operator. As an example, when 
m = 3 we have: 



G = 



10 10 10 1 
110 11 
1111 



If the active probability of the Z'th component pi = 0, this 
implies the corresponding column G :> z can be removed from 
G. Before proceeding to the details of the algorithm, we first 
present a technical lemma. 

Lemma 4: Consider a set x = \x\, x 2 , ■ ■ ■ TXh-i:Xh] T 
generated by the data model in ([TJ, i.e., 3 binary independent 
sources y, s.t., x = G <E> y. The conditional random vector 
Xx h —o = [xi,x 2 ,---,Xh-i\xh = 0] T corresponds to the 
vector of the first h — 1 elements of x when Xh = 0. Then, 
x x h =o = G' ® y', where G' = G. i,,, 2 h-i (i.e. the first 
2 hl1 columns of G) and V{y\ = 1) = V(yi = 1) for 
l = l,...,2 h -\ 

Proof: We first derive the conditional probability distri- 
bution of the first h — 1 observation variables given Xu = 0, 



= V(xi,X2,- 

2 h -l 
(a) 



,Xh-i I Xh = 0) 

,x h -\ I Xh = 0)V(x h = 0) 



= e ntfc!-*) 1 -* 

yeY(x) i=i 

e n pfa-pi) 1 -" 1 n c 1 -^) 

yi..2*-i e y(*i..h-i) 9h,=0 9h,=1 

Vi = 0,V.g w = 1 

(7) 



(a) is due to (4i. Since V{x h = 0) 



9hl=l 



F{x 1 ,x 2 ,. 


. .,Xh-l 




2 h -l 


E 


no 


y'6^(x l!h _ 


1) «=i 




E 


yi,...,2 h - 1 


eF(x x 


yi = o, V5 


?u = 1 



I x h = 0) 






,?i-i 



Pi), we have Algorithm 1: Incremental binary ICA inference algorithm 

FindBICA (X) 
input : Data matrix X mX T 
init : n = 2 m - 1; 
Ph = 0,ft = l,...,n; 

G = m x (2 m — 1) matrix with rows corresponding all possible 
binary vectors of length m\ 
\ I—2/1 e = the minimum threshold for p/ t to be considered a real 

component; 

1 if m = 1 then 



Clearly, by setting T>{y[ = 1) = T(yi = 1) for 



(8) 



5/1-1 



1,. 

the conditional random vector x 



the above equality holds. In the other word, 



x h =0 



= G' 



for G' = 



G :i i...2'*-1- ■ 

The above lemma establishes that the conditional random 
vector x Xh =o can be represented as an OR mixing of 2 ,l_1 
independent components. Furthermore, the set of the indepen- 
dent components is the same as the first 2 h ~ 1 independent 
components of x (under proper ordering). 



Consider a sub-matrix of data matrix X, X 



the rows correspond to observations of x\,X%, 



T such that x hi = 0. Define X 



('>- 



-i)xT> where 
..,Xh-i for 
-i)xT> which 



i = l,2, 

consists of the first h — 1 rows of X. 

have computed the bICA for data matrices x9 h _ 1 \ xT 



Suppose that we 
and 



4 



we know that X 



(ft-l)xT 



is real- 



independent components, denoted 
1] - V(yi --- 1) for 

(h-l)xT 



X(h-i)xT- From Lemma 

ization of OR mixtures o 

by y°h-f Furthermore, P\yl h -i(l) 

I = 1, ...,2 h ~ l . Clearly, X^ h _ 1 - )xT is realization of OR 

mixtures of 2 h ~ 1 independent components, denoted by y 2 h-i. 

Additionally, it is easy to see that the following holds: 

^2^(0 = 1] 

= 1 - [1 - P(y§H-i(0 = 1)][1 - V(yi+2»-i = 1)] (9) 

= 1 - (1 -Pi)(l -Pj+2h-l)) 

Oh 



where I = 1 

Pi 

Pl+2 h ~ 1 = 
P2 h - 1 + 1 



Therefore, 



P(y°,_ 1 (0 = l) ! 



= 1- 



•p(y 2 h-i(0=i) 



1-^0^-1(0=1)' 

j r (x h =iAx i =oyie[i...h-i]) 



1 = 1,.. 

1 = 2,.. 



n 



l = 1...2 n ,l^2 n 



(1-Pl) 



nh-1 

I Z ! 

o/l-l 

I Z ! 

(10) 

The last equation above holds because realizations of x where 
(Xk = 1 while Xi = 0; Mi e {0, . . . , fe — 1}) are generated 
from OR mixtures of y 2 «>-i's only. 

Let F(A) be the frequency of event A, we have the iterative 
inference algorithm as illustrated in Algorithm [T] 

When the number of observation variables m = 1, there are 
only two possible unique sources, one that can be detected by 
the monitor x\, denoted by [1]; and one that cannot, denoted 
by [0]. Their active probabilities can easily be calculated by 
counting the frequency of [x\ = 1) and (x\ = 0) (lines fTI- 
p). If m > 2, we apply Lemma |4] and (jTOJ to estimate p and 
G through a recursive process. X? m _ 1 \ xT is sampled from 
columns of X that have x m = 0. If X/ m _ 1 ^ xT is an empty 
set (which means x mt = 1, W) then we can associate x m with 
a constantly active component and set the other components' 



9 
10 
11 

12 



else 



pi = T(x! - 0); 
p 2 = F(xi - 1); 



if v 

11 A (m-l)xT - 
Pl...2"»-1 = 
p 2m -l +1 = 

P2 m -!+2...2 



then 

FindBICA (X (m _ 1)xT ); 

1; 

■n =0; 



else 



P1...2— 1 = FindBICA (X^ m _ 1)xT ); 
Pi...2">-i = FindBICA (X (m _ 1)xT ); 



|_ P/+2"-: 
Pam-l + l = 



- 1 _ 1 -Pj ■ 
1-Pi ' 

J r (a: m = lA3:i=0,Vie[l...m-l]) 

n, = i... 2 '"-l,!^2'"-l+l (1-w) ' 



13 for h 



,2 m do 



14 
15 



if (jp h < e) V (p h = 0) then 
I prune pj, and corresponding columns g^; 



16 output: p and G 



probability accordingly (lines 4 - pi. If X < ? m _ 1 - )xT 
empty, we invoke FindBICA on two sub-matrices X?, 



and X 



(m— l)x 



T to determine pi 



as in (10 1 (lines 10 



(m-l)xT 

and p^ 2m _i, then 
L2b. Finally, p^ and 



infer p 2 " 

its corresponding column gh in G are pruned in the final result 

ifp h <e (lines [13] -[B). 

Reducing computation complexity: Let 5(m) be the compu- 
tation time for finding bICA given X mX T- From Algorithm [T] 
we have, 

S(m) = 2S(m~l)+c2 m , 

where c is a constant. It is easy to verify that S (to) = cm2 m . 
Therefore, Algorithm [T] has an exponential computation com- 
plexity with respect to to. This is clearly undesirable for large 
to's. However, we notice that in practice, correlations among 
Xi's exhibit locality, and the G matrix tends to be sparse. 
Instead of using a complete bipartite graph to represent G, the 
degree of vertices in V (or the number of non-zero elements 
in G-.k) tend to be much less than to. In what follows, we 
discuss a few techniques to reduce the computation complexity 
by discovering and taking advantage of the sparsity of G. We 
first establish a few technical lemmas. 

Lemma 5: If Xi and x% are uncorrected, then V{yi = 1) = 

0, V; s.t., gu = 1 and g H = 1. 

Proof: We prove by contradiction. Suppose 31, s.t., gu = 

1, gu = 1, and V{yi = 1) = 0. Denote the Ts by a set L. 
Let u = AizLVh From the assumption, u is non-degenerate. 



Without loss of generality, we can represent Xi and Xk as 

Xi — u V Vi 

Xk = u V V2 

where v\ and v 2 are disjunctions of remaining non-overlapping 
components in Xi and Xk, respectively From Lemma [3] we 
know that Xi and Xk are correlated. This contradicts the 
condition. ■ 

Lemma 6: Consider the conditional random vector 
x Xfc=0 = [xi,x 2 ,--.,Xk-i\xk = 0] T from a set 
x = [xi, X2, • • ■ , Xk-i, Xk] T generated by the data model in 
([TJ. If Xi and Xk are uncorrelated, x Xk= o(i) and x Xk= o(k) 
are uncorrelated. 

Proof: This lemma is a direct consequence of Lemma [4] 
and Lemma [5] ■ 

Lemma [5] implies that pair-wise independence can be used 
to eliminate edges/columns in G. Lemma [6] states that the 
pair-wise independence remains true for conditional vectors. 
Therefore, we can treat the conditional vectors similarly as the 
original ones. 

We also observe that for x = [x\,X 2 , ■ ■ ■ ,Xk-i,Xk] T if (fPTj) 
holds then there does not exist an independent component 
that generates Xk and some of Xj, j = 1,2, .. .k — 1. In 
other words, Xk is generated by a "separate" independent 
component. 

V{x\,x 2 , ■ • .,x k -i,x k ) = V(x 1 ,x 2 , ■ ■ .,Xk-i)P(xk) (11) 

Finally from ((5), we see that V(yk-i{l) = 1) > 
m&x(pi,pi +2 k-i). Note that V(yk-i(l) = 1) is inferred from 
X(h) X t, while the latter two are for XkxT- This property 
allows us to prune the G matrix along with the iterative 
procedure (as opposed to at the very end). 

Now we are in the position to outline our complexity 
reduction techniques. 
Tl For every pair i and k, compute 



Cov(i, k) 



zZt XnXkt 22 1 Xii Z^t Xkt 



T T T 

Let the associated p-value be p(i,k). The basic idea 
of p-value is to use the original paired data (Xi, Yi), 
randomly redefining the pairs to create a new data set 
and compute the associated r-values. The p-value for the 
permutation is proportion of the r-values generated that 
are larger than that from the original data. If p(i, k) > e, 
where e is a small value (e.g., 0.05), we can remove the 
corresponding columns in G and elements in y. 
T2 We can determine the bICA for each sub-vector sepa- 
rately if the following holds, 



V(x x 



i x k ) = V(x x , ..., xi)V{x 1+1 , ...,x k ). 



T3 If the probability of the i'th component of X k xT Pi < e, 
then Vj, s.t., G- : i -< G-j, the probability of the j'th 
component of X k ' X T Pj < £ f° r k' > k. In another 
word, these columns and corresponding components can 
be eliminated. 



From our evaluation study, we find the computation time is on 
the order of seconds for a problem size m — 20 on a regular 
desktop PC. 

V. The Inverse Problem 

Now we have the mixing matrix G mxn and the active 
probabilities 'P(y), given observation X mx x, the inverse 
problem concerns inferring the realizations of the latent vari- 
ables Y nx x- Recall that n is the number of latent variables. 
Denote yi to be the binary variable for the i'th latent variable. 
Let x = G (8> y. We assume that the probability of observing 
X given x depends on their Hamming distance d(x, X) = 

Ei\Xi-Xil and V{x.\X) = pi { *' x \i- Pe ) m - d ^ x \ 

where p e is the error probability of the binary symmetric 
channel. To determine y, we can maximize the posterior 
probability of y given X derived as follows, 



Hy\x} 



(a) 



v{X\y}v{y} 

V{X} 
v{X\y}v{y} 

V{X} 
■p{X,x|y}-p{y} 

nx } 

v{X\x}v{y} 
V{X} 



IlTLlPe 



r{ V i} 

V A X} ' Y 



n^iPVa-Pi) 1 



v{X} 

where x = G <E) y. (a) and (b) are due to the deter- 
ministic relationship between x and y. Recall that Xi = 
Vy=i (dij A j/j ) , -j = 1, . . . , to. With M is a "large enough" 
constant, we can use big-M formulation |[T4l to relax the 
disjunctive set and convert the above relationship between x 
and y into the following two sets of conditions: 



Xj 

M -Xi 



< E|=iftj%, Vi=l,.. 



, m. 
, TO. 



(12) 



Here, since Y^=i9ijVj 

taking log on both sides and introducing additional auxiliary 

variable a,- = IX, - 



< n, we can set M = n. Finally, 

mxiliary 
we have the the following integer 



programming problem: 



max . 



s.t. 



Y^ t" 4 l ° gPe + ( 1 ~ ai ) 1 °g( 1 ~ P p )] 




+ E"=i K 1 - Vj) log (1 - Pj) + yj logpj] 




Xi<^2gijVj, V?: = i,...,to, 




n 

n-Xi^^gijVj, V* = l,...,m, 




j— i 
on ^ -X"» — Xi, Vi = 1, . . . , m, 
a.i > Xi — Xi, Vi = 1, . . . ,m, 
aa,Xi,yj ={0,1}, Vi = l,...,m,j = 1,. 


..,n. 
(13) 



This optimization function can be solved using ILP solvers. 
Note that p e can be thought of the penalty for mismatches 
between x-, and Xj, 



Zero Error Case: If X is perfectly observed, containing no 
noise, we have p e — and a, = Xj — Xi — 0, or equivalently, 
Xi = Xi. The integer programming problem in ( f]"3j ) can now 
be simplified as: 

n 

max . 51 [(1 - 2/j) log (1 - Pj) + y 3 \ogp } } 

V 3=1 



II 



s.t. 



X t <^2gijyj, 

3 = 1 



\/i = 1, . . . ,m, 



(14) 



n ■ X t >^2 9ijVj, Vi = 1, . . . , m, 

3 = 1 

^■ = {0,1}, Vj = l,...,n. 

Clearly, the computation complexity of the zero error case is 
lower compared to (13) . It can also be used in the case where 
prior knowledge regarding the noise level is not available. 

VI. Evaluation 

In this section, we first introduce performance metrics used 
for the evaluation of the proposed method, and then present 
its performance under different network topologies with dif- 
ferent level of observation noises. In evaluating the structural 
errors of bICA, we also make an independent contribution by 
devising a matching algorithm for two bipartite graphs. 

We compare the proposed algorithm with the Multi- 
Assignment Clustering (MAC) algorithm [7]. The source code 
of MAC was obtained from the authors' web site. As shown in 
Table p] none of existing algorithms follows exactly the same 
model and/or constraints as bICA. For instance, MAC assumes 
the knowledge of the dimension of latent variables. Therefore, 
the comparisons bias against our proposed algorithm. 

A. Evaluation metrics 

We denote by p and G the inferred active probability of 
latent variables and the inferred mixing matrix, respectively. 
1) In Degree Error & Structure Error: Let || • ||i be the 



1-norm defined by ||a;||i = 2~27=i 2~2i 



u i3\- 



diag(-) and 



triu(-) denote the main diagonal and the upper triangular 
portion of a matrix, respectively. Two metrics are introduced 
in to evaluate the dissimilarity of G and G. In degree 
error £y is the difference between the true in-degree of G and 
the expected in-degree G. It is computed by taking the sum 
absolute difference between diag(GG T ) and diag(GG ) as 
the following: 



E d 



A 



\diag(GG T ) - diag(GG 



(15) 



The second metric, structure error E s is the sum of absolute 
difference between the upper triangular portion of GG and 
GG , defined as: 



E„ 



\triu(GG 



T\ 



trm(GG T )||i 



(16) 



Since each element of the upper triangular portion of GG 
is a count of the number of hidden causes shared by a pair of 
observable variables, the sum difference is a general measure 
of graph dissimilarity. 



2) Normalized Hamming Distance: This metric indicates 
how accurate the mixing matrix is estimated. It is defined by 
the Hamming distance between G and G divided by its size. 

H * -J-ET-i d H (G. i ,G. i ). (17) 

To estimate H g however, two challenges remain: First, the 
number of inferred independent components may not be 
identical as the ground truth. Second, the order of independent 
components in G and G may be different. 

To solve the first problem, we can either prune G or 
introduce columns into G to equalize the number of com- 
ponents (n = n, where n is the number of columns in G). 
For the second problem, we propose a matching algorithm 
that minimizes the Hamming distance between G and G by 
permuting the corresponding columns in G. 

Structure Matching Problem: A naive matching algorithm 
needs to consider all h\ column permutations of G, and 
chooses the one that has the minimal Hamming distance to G. 
This approach incurs an exponential computation complexity. 
Next, we first formulate the best match as an Integer Linear 
Programming problem. Denote the Hamming distance between 
column G.. t i and G-.j as c^ > 0. Define a permutation matrix 
A nxn with dij — 1 indicating that the i'th column in G is 
matched with the j'th column in G. The problem now is to 
find a permutation matrix such that the total Hamming distance 
between G and G (denoted by d (G, G)) is minimized. We 
can formulate this problem as an ILP as follows: 

n n 

m j, n - EE^ fl y 

i=l 3=1 



^Qij = 1, 



(18) 



J2 a *3 =1 > 

J'=l 

a,ij = {0,1}, Vi,j = l,...,n. 

The constraints ensure the resulting A is a permutation ma- 
trix. This problem can be solved using ILP solvers. However, 
we observe that the ILP is equivalent to a maximum-weight 
bipartite matching problem. In the bipartite graph, the vertices 
are positions of the columns, and the edge weights are the 
Hamming distance of the respective columns. If we consider 
d (G^i,G : .i), the Hamming distance between column G : i 
and G-.^i, to be the "cost" of matching G- i to G :> i, then the 
maximum- weight bipartite matching problem can be solved in 
0(n 3 ) running time lfT31 . where n is the number of vertices. 
The algorithm requires G and G to have the same number of 
columns. 

One greedy solution is to prune G by selecting the top 
n components from G, which have the highest associated 
probabilities pi since they are the most likely true components. 
However, when T is small and/or under large noise, we 
may not have sufficient observations to correctly identify 
components in G with high confidence. As a result, true 
components might have lower active probabilities comparing 



Algorithm 2: Bipartite graph matching algorithm 

MatchPG (G, G, p) 

input : Gmxn, G mX n, Pixa\ (n < n < 2 m ) 



init 



g' 



0; p' lxA = 0; C h , 



0; 



1 for i = 1, . . . , n do 

2 for j = l,...,n do 

3 if Qi = then 

4 j cy = d H (G : ,i,G:,j) x m; 
else 

6 A = BipartiteMatching (C); 

7 for i = 1, . . . , n do 



find j such that ay = 1; 
G-i = G-i; 



Pi =P 3 \ 

n Prune 6': g' = G' i:l ...„; 

12 Prune p'\ p' — p'i,,, n ; 

13 output: G and p' 




Fig. 3: From top to bottom: inferred matrix G with 18 inferred 
components, transformed matrix G with only 10 components re- 
maining (8 noisy ones were removed), original G and difference 
matrix between G and G . Black dot = 1 and white dot = 0. 



to the noise components. To address the problem, we instead 
keep a larger n and introduce n — n artificial components into 
G. These components will be represented by zero columns 
in G. While matching the inferred columns in G to the 
columns in G, clearly an undesirable scenario occurs when we 
accidently match a column in G to an additive zero column in 
G. This happens when an inferred column G-,i is sparse (i.e. 
having a very small Hamming distance to the zero column). 
To avoid the incident, we multiply the cost of matching any 
column in G to a zero column in G by m. This eliminates the 
case in which a column G- : i is matched with a zero column in 
G, since it is more expensive than matching with another non- 
zero column G- j. We can now select the best n candidates in 
G, which yields a reduced mixing matrix G of size m x n, and 
elements in active probability vector p' will also be selected 
accordingly. The solution to the structure matching problem 
is detailed in Algorithm [2] 

In the algorithm, lines [T] - [5] build the input weight matrix 
Cfixn f°i" the bipartite matching algorithm. If G.{ is a zero 
column, Cij will be scaled by m to avoid the matching 
between column G :J ; and G : j (linekl. The bipartite matching 
algorithm finds the optimal permutation matrix A to transform 
G into G that is "closest" to G (lines |6J-[lO|. We are only 
interested in the first n columns of G and p' (as they most 
likely represent the true PUs). Therefore, G and ft are pruned 
in lines [TD-EE] 

As an example, the inferred result of a random network 
with n = m = 10 is given in Figure [3] Non-zero entries and 
zero entries of G, G, and G are shown as black and white 
dots, respectively. The entry-wise difference matrix \G— G | is 
given in the bottom graph. Gray dots in the difference matrix 
indicate identical entries in the inferred G and the original G; 
and black dots indicate different entries (and thus errors in the 
inferred matrix). In this case, only the first row (corresponding 



to the first monitor x±) contains some errors. 

3) Transmission Probability Error: The prediction error in 
the inferred transmission probability of independent users is 
measured by the Kullback-Leibler divergence between two 
probability distributions p and p. Let p' and p' denotes 
the "normalized" p and p (p^ = Pi^2™—iPi), Transmission 
Probability Error is defined as below (the K-L distance): 



p(p',p') = E?=ifti°g(^). 



(19) 



Intuitively, Transmission Probability Error gets larger as the 
predicted probability distribution p deviates more from the real 
distribution p. 

4) Activity Error Ratio: After applying FindBICA in 
Algorithm 111 on the measurement data of length T to obtain G 
and p, realizations of the hidden variables can be computed by 



solving the maximum likelihood estimation problem in (13 1. 
We define 



H„ 



- — V T d H (Y ■ Y ■) 



(20) 



where Y :i is the z'th column of Y . Similar to H g , this metric 
measures how accurately the activity matrix is inferred by 
calculating the ratio between the size of y and the absolute 
difference between y and y. 

B. Experiment Results 

We have implemented the proposed algorithm in Matlab. 
Four network topologies are manually created (Fig. Wa)) rep- 
resenting different scenarios: connected vs. disconnected net- 
work, under-determined (to > n) vs. over-determined network 
(m < n). All experiments are conducted on a workstation with 
an Intel Core 2 Duo T5750@2.00GHz processor and 2GB 
RAM. For each scenario, 50 runs are executed; the results 
presented are the average value and 99% confidence interval. 
To evaluate the robustness of the interference algorithm, two 
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Fig. 4: Experiment result for bICA and MAC with fixed network topologies. Black lines are bICA performance while gray lines are MAC 
performance. The top row shows the four network topologies. The second to fifth rows show the mean Normalized Hamming Distance, 
Transmission Probability Error, In Degree Error, Structure Error and Activity Error Ratio of four topologies as T increases from 50 to 1,000. 
On the last row is mean CPU runtime measured in seconds. Each graph shows experiment results at 2 different levels of noise: 0% and 
10%. Error bars are symmetric, and indicate standard deviation over 50 runs with different initialized seeds. 
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different levels of noise are introduced, i.e. p e = 0% and 10%. 
In our experiment, noise is generated by randomly flipping an 
entry of the observation matrix X with probability p e . 

Evaluation results for bICA and MAC at the two noise 
levels are presented in Fig. ffib), (d), (e), and (f). We do 
not include the results of MAC in Fig. |4jc) since it does 
not infer the active probability for hidden components. From 
Fig. |4jb), we observe at zero noise, bICA converges quickly to 
the ground truth, and G matrix has been accurately estimated 
with only 100 observations. bICA is comparable to or slightly 
outperforms MAC in the first two topologies, and significantly 
outperform MAC on the later two. It appears the MAC is quite 
sensitive to noise even in small structures. In contrast, the 
accuracy of bICA under 10% noise degrades when the number 
of the observations is small but improves significantly as more 
observations are available. Thus, bICA is more resilient to 
noise than MAC. As shown in Fig. Wd) and (e), inference 
errors tend to increase with the same number of observations 
for both schemes as the structure become more complex for 
both schemes. However, bICA only degrades gracefully. 

Fig. |4|f) shows the accuracy of the solution to the inverse 
problem. In this set of experiments, we first determine G and 
p from the measurement data of length T. Then realizations of 
the hidden sources are estimated by solving the MLE problem 
in (\3\ . The predication error is measured by the Activity 
Error Ratio. We see that at the 0% noise level, both methods 
perform quite well. Since the inference of each realization of 
y is independent from the others, increasing T does not help 
improving the accuracy of the inverse problem (though higher 
T gives a better estimation of G and p. Performance of both 
methods degrades as the noise level increases. Noise has two 
effects on the solution to inverse problem. First, the inferred 
mixing matrix and the active probability can be erroneous. 
Second, no maximum likelihood estimator guarantees to give 
the exact result when the problem is under or close to under- 
determined with noisy measurements. To verify the second 
argument, we provide the exact G and p to the MLE formu- 
lation, and observe comparable errors in the inferred y as the 
case where G and p are both inferred by bICA. This implies 
that the main source of errors in the inverse problem comes 
from the under-determined or close to under-determined name 
of the problem. 



TABLE III: Average computation time (in seconds) of bICA and 
MAC over the 4 network topologies, 2 noise levels, and all sample 





Topology 1 


Topology 2 


Topology 3 


Topology 4 


bICA 


0.49 


0.38 


0.38 


0.64 


MAC 


166.5 


289.21 


302.35 


538.47 



Finally, from Fig. |4|g), the computation time of bICA is 
negligible (under 0.5 second in most cases and under 1.5 
seconds in the worst case). For the ease of comparison, we list 



the numerical values in Table III MAC uses a gradient descent 
optimization scheme, it is much more time-consuming. As 
mentioned earlier, the complexity of bICA is a function of m 



and the sparsity of G. In practice, computation time of bICA 
is also a monotonic function of the number of observations 
T and noise levels. As T is getting larger, the storage and 
computation complexity tends to grow. When T is small or 
the noise level is high, the structure inferred may error on the 
higher complexity side, resulting longer computation time. 

VII. Applications 

In this section, we present some case studies on real-world 
application of bICA. In general, bICA can be applied to 
any problem that need identifying hidden source signals from 
binary observations. The proposed method therefore can find 
applications in many domains. In multi-assignment clustering 
Q, where boolean vectorial data can simultaneously belong 
to multiple clusters, the binary data can be modeled as the 
disjunction of the prototypes of all clusters the data item 
belongs to. In medical diagnosis, the symptoms of patients 
are explained as the result of diseases that are not themselves 
directly observable [5|. Multiple diseases can exhibit similar 
symptoms. In the Internet tomography [16|, losses on end-to- 
end paths can be attributed to losses on different segments 
(e.g., edges) of the paths. In all above generic applications, 
the underlying data models can be viewed as disjunctions 
of binary independent components (e.g., membership of a 
cluster, presence of a disease, packet losses on a network edge, 
etc). Now we will introduce in detail two specific network 
applications in which bICA has been effectively applied. 

A. Optimal monitoring for multi-channel wireless networks 

Passive monitoring is a technique where a dedicated set 
of hardware devices called sniffers, or monitors, are used to 
monitor activities in wireless networks. These devices capture 
transmissions of wireless devices or activities of interference 
sources in their vicinity, and store the information in trace files, 
which can be analyzed distributively or at a central location. 
Most operational networks operate over multiple channels, 
while a single radio interface of a sniffer can only monitor one 
channel at a time. Thus, the important question is to decide the 
sniffer-channel assignment to maximize the total information 
(user transmitted packets) collected. 

Network model and optimal monitoring: Consider a system 
of 77i sniffers, and n users, where each user u operates on one 
of K channels, c(u) e K, = {1, . . . ,K}. The users can be 
wireless (mesh) routers, access points or mobile users. At any 
point in time, a sniffer can only monitor packet transmissions 
over a single channel. We represent the relationship between 
users and sniffers using an undirected bi-partite graph G = 
(S, U, E), where S is the set of sniffer nodes and U is the 
set of users. An edge e = (s, u) exists between sniffer s E S 
and user u € U if s can capture the transmission from u. If 
transmissions from a user cannot be captured by any sniffer, 
the user is excluded from G. For every vertex v € U U S, we 
let N(v) denote vertex v's neighbors in G. For users, their 
neighbors are sniffers, and vice versa. We will also refer to 
G as the binary m x n adjacency matrix of graph G. An 
example network with sniffers and users, the corresponding 
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Fig. 5: A sample network scenario with number of sniffers m = 
5, number of users n — 10, its bipartite graph transformation and 
its matrix representation. White circles represent independent users, 
black circles represent sniffers and dashed lines illustrate sniffers' 
coverage range. 



bipartite graph G, and its matrix representation G are given 
in Figure B] 

If we assume that G is known by inspecting packet headers 
information from each sniffers' captured traces, then the trans- 
mission probability of the users p = (p u )ueu are available 
and are assumed to be independent. As mentioned earlier, 
the more complete information can be collected, the easier 
it is for a network administrator to make decisions regarding 
network troubleshooting. We can therefore measure the quality 
of monitoring by the total expected number of active users 
monitored by the sniffers. Our problem now is to find a sniffer 
assignment of sniffers to channels so that the expected number 
of active users monitored is maximized. It can be casted as 
the following integer program: 



max . 



(21) 



Vs e S, 
Vue U, 
Vue U, 

Vw, s, k, 

- 1 if the sniffer is 



s - L Efc=i z s,fc<i, 

Vu < }^ s eN(u) Z s,c(u)i 
Vu < 1, 

y u ,z s ,k g {0,1}, 

where the binary decision variable z s ^ 
assigned to channel k; otherwise. y u is a binary variable 
indicating whether or not user u is monitored, and p u is the 
active probability of user u. 

Network topology inference with binary observations: 

From pT) , it is clear that we need the network and user-level 
information in order to maximize the quality of monitoring. 
However, this information is not always available. We consider 
binary sniffers, or sniffers that can only capture binary infor- 
mation {on or off) regarding the channel activity. Examples 
of such kind of sniffers are energy detection sensors using for 
spectrum sensing. The problem now is to infer the user-sniffer 
relationship (i.e. G) and the active probability of users from 



the observation data (i.e. X). Let x = [x%, x%, . . . , x m ] T be a 
vector of m binary random variables and X be the collection 
of T realizations of x, where xu denotes whether or not sniffer 
Si captures communication activities in its associated channel 



at time slot t. Let y = [yl, y2, . 



,2/n 



be a vector of n binary 



random variables, where yj = 1 if user Uj transmits in its 
associated channel, and yj = otherwise. Sniffer observations 
are thus disjunctive mixtures of user activities. In other words, 
relationship between x and y is x = G <S> y and thus we can 
use bICA to infer G and p. 

WiFi trace collection and evaluation result: We evaluate 
our proposed scheme by data traces collected from the Uni- 
versity of Houston campus wireless network using 21 WiFi 
sniffers deployed in the Philip G. Hall. Over a period of 6 
hours, between 12 p.m. and 6 p.m., each sniffer captured 
approximately 300,000 MAC frames. Altogether, 655 unique 
users are observed operating over three channels. The number 
of users observed on channels 1, 6, 11 are 382, 118, and 
155, respectively. Most users are active less than 1% of the 
time except for a few heavy hitters. User-level information is 
removed leaving only binary channel observation from each 
sniffer. G and p are then inferred using bICA and input 



to the integer program (21 1 to find the best sniffer channel 



assignment that maximize the expected number of active users 
monitored. Obviously, the more accurate the network model 
is inferred, the better the assignment is and the more users 
are monitored. We also vary the result by randomly select a 
subset of sniffers and observe the number of monitored users 
from this set of sniffers. Result is presented in Fig. [6] 
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Fig. 6: Expected number of active users monitored with the number 
of sniffers vary from 5 to 21. 

In Fig. [6] we compare the average of number of active 
users monitored using the inferred G and p, and the ground 



truth. The integer programming problem (21 1 is solved using 



a random rounding procedure on its LP relaxation, which is 
shown to perform very close to the LP upper bound in our 
earlier work ifTTIl . 

Note that most users are active less than 1% and the 



13 



average active probability of users is 0.0014. The system 
consists of 655 unique users, therefore the average number 
of active users monitored is around 1. For comparison, we 
also include a naive scheme (Max) that puts each sniffer to 
its busiest channel. Therefore, Max does not infer or utilize 
any structure information. From Fig. [6] we observe that the 
sniffer-channel assignment scheme with bICA (LPR - BICA) 
performs close to the case when full information is available 
(LPR - USER), and much better than an agnostic scheme 
such as Max (MAX). This demonstrates that bICA can indeed 
recover useful structure information from the observations. 

B. PU separation in cognitive radio networks 

With tremendous growth in wireless services, the demand 
for radio spectrum has significantly increased. However, spec- 
trum resources are scarce and most of them have been already 
licensed to existing operators. Recent studies have shown that 
despite claims of spectral scarcity, the actual licensed spectrum 
remains unoccupied for long periods of time lfl8l . Thus, 
cognitive radio (CR) systems have been proposed JT9), ll20l , 
[21 1 in order to efficiently exploit these spectral holes, in which 
licensed primary users (PUs) are not present. CRs or secondary 
users (SUs) are wireless devices that can intelligently monitor 
and adapt to their environment, hence, they are able to share 
the spectrum with the licensed PUs, operating when the PUs 
are idle. 

One key challenge in CR systems is spectrum sensing, 
i.e., SUs attempt to learn the environment and determine the 
presence and characteristics of PUs. Energy detection is one of 
the most commonly used method for spectrum sensing, where 
the detector computes the energy of the received signals and 
compares it to a certain threshold value to decide whether the 
PU signal is present or not. It has the advantage of short detec- 
tion time but suffers from low accuracy compared to feature- 
based approaches such as cyclostationary detection l20l . l2ll . 
From the prospective of a CR system, it is often insufficient 
to detect PU activities in a single SU's vicinity ("is there 
any PU near me?")- Rather, it is important to determine the 
identity of PUs ("who is there?") as well as the distribution 
of PUs in the field ("where are they?"). We call these issues 
the PU separation problem. Clearly, PU separation is a more 
challenging problem compared to node-level PU detection. 

Solving PU separation problem with bICA: Consider a 
slotted system in which the transmission activities of n PUs 
are modeled as a set of independent binary variables y with 
active probabilities 'P(y). The binary observations due to 
energy detection at the m monitor nodes are modeled as an 
m-dimension binary vector x = [x\, X2, ■ • - , x m ] T with joint 
distribution 'P(x). It is assumed that presence of any active 
PU surrounding of a monitor leads to positive detection. If 
we let a (unknown) binary mixing matrix G represents the 
relationship between the observable variables in x and the 
latent binary variables in y = [j/ 1; y 2 , ■ ■ ■ , y n ] T , then we can 
write x = G <E) y. The PU separation is therefore amenable to 
bICA. 



Inferring PU activities with the inverse problem: Extracting 
multiple PUs activities from the OR mixture observations is a 
challenging but important problem in cognitive radio networks. 
Interesting information, such as the PU channel usage pattern 
can be inferred once Y is available. The SUs will then be 
able to adopt better spectrum sensing and access strategies to 
exploit the spectrum holes more effectively. Now suppose that 
we are given the observation matrix X and already estimated 
the mixing matrix G and the active probabilities "P(y). Solving 
the inverse problem gives the PU activity matrix Y. 

Simulation setup and result: In the simulation, 10 monitors 
and n PUs are deployed in a 500x500 square meter area. 
We fix the sample size T = 10, 000 and vary the number 
of PUs from 5 to 20 to study its impact on the accuracy of 
our method. Locations of PUs are chosen arbitrarily on the 
field. The PUs transmit power levels are fixed at 20mW, the 
noise floor is -95dbm, and the propagation loss factor is 3. 
The SNR detection threshold for the monitors is set to be 
5dB. PUs activities are modeled as a two-stage Markov chain 
with transition probabilities uniformly distributed over [0; 1]. 
A monitor reports the channel occupancy if any detectable 
PU is active. Noise is introduced by randomly flip a bit in 
the observation matrix X from 1 to (and vice versa) at 
probability e. e is set at 0%, 2%, and 5%. Prediction error 
on G and Y over 50 runs for each PU setting are shown in 
Fig.0 
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Fig. 7: Inference error on G and Y at three noise levels. 

As we can see, at noise level 0%, bICA can accurately 
estimate the underlying PU-SU relationship and the hidden 
PU activities for small number of PUs. However, introducing 
noise or having more PUs tend to degrade the performance 
of the inference scheme. The errors in the noisy cases can be 
attributed to the fact that the average PU active probability 
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is around 2%, which is comparable to the noise level. More 
information on this application can be found in l22l . 

VIII. Conclusion 

In this paper, we provided a comprehensive study of the 
binary independent component analysis with OR mixtures. 
Key properties of bICA were established. A computational 
efficient inference algorithm have been devised along with the 
solution to the inverse problem. Compared to MAC, bICA 
is not only faster, but also more accurate and robust against 
noise. We have also demonstrated the use of bICA in two 
network applications, namely, optimal monitoring in multi- 
channel wireless networks and PU separation in cognitive 
radio networks. We believe the methodology devised can be 
useful in many other application domains. 

As future work, we are interested in devising inference 
schemes that can easily incorporate priori knowledge of the 
structure or active probability of latent variables. Also on the 
agenda is to apply bICA to problems in application domains 
beyond wireless networks. 
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